Skip to content

[SPARK-55787][SQL] Add is_struct_empty and is_struct_non_empty built-in functions#55314

Open
Kino1994 wants to merge 1 commit intoapache:masterfrom
Kino1994:feature/SPARK-55787
Open

[SPARK-55787][SQL] Add is_struct_empty and is_struct_non_empty built-in functions#55314
Kino1994 wants to merge 1 commit intoapache:masterfrom
Kino1994:feature/SPARK-55787

Conversation

@Kino1994
Copy link
Copy Markdown

What changes were proposed in this pull request?

This PR introduces two new built-in SQL functions for detecting structs where all fields are null:

  • is_struct_empty(struct) — Returns true if the struct is non-null and all of its fields are null. Returns null if the struct itself is null.
  • is_struct_non_empty(struct) — Returns true if the struct is non-null and at least one field is non-null. Returns null if the struct itself is null.

Both functions perform a shallow check only: nested structs that are themselves non-null (even if all their children are null) count as non-null at the parent level.

The implementation includes:

  • Expression classes (IsStructEmpty, IsStructNonEmpty) with full codegen support in complexTypeCreator.scala, including an unrolled AND/OR chain for narrow structs (≤8 fields) and a loop with early exit for wider structs.
  • Registration in FunctionRegistry.
  • Scala API methods in functions.scala.

Why are the changes needed?

In Structured Streaming pipelines with Kafka + Avro deserialization, permissive failure handlers (e.g. PermissiveRecordExceptionHandler) replace malformed records with structs that are non-null but have all fields set to null. These "empty" structs can propagate downstream and cause runtime failures in sinks like Apache Kudu when struct fields map to primary keys that cannot be NULL.

Currently the only workaround is to_json(col) != "{}", which:

  • Forces full JSON serialization per row (heavy object allocation and GC pressure).
  • Relies on magic string comparison (not type-aware).
  • Breaks whole-stage codegen optimization.

The proposed functions replace this pattern with an efficient, schema-agnostic null-check that requires zero serialization, supports short-circuit evaluation, and participates fully in whole-stage codegen.

Does this PR introduce any user-facing change?

Yes. Two new built-in SQL functions are added:

-- Filter out malformed records (all-null structs) from Kafka stream
SELECT * FROM kafka_messages
WHERE value IS NOT NULL AND is_struct_non_empty(value)

Previously, users had to rely on to_json:

SELECT * FROM kafka_messages
WHERE value IS NOT NULL AND to_json(value) != '{}'

How was this patch tested?

  • Unit tests in ComplexTypeSuite covering: non-null structs, all-null structs, null struct input, mixed nulls, single-field structs, zero-field (empty schema) structs, nested structs, structs with array/map fields, type validation, and nullable semantics.
  • Integration tests in DataFrameComplexTypeSuite simulating the Avro permissive handler use case end-to-end (both DataFrame API and SQL syntax), including equivalence check against the to_json workaround.
  • A dedicated wide-struct test (12 fields) to exercise the loop codegen path.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.6)

…in functions

Add two new Catalyst expressions for efficiently detecting structs where
all fields are null, without resorting to serialization (e.g. to_json).

This addresses a common pain point in Structured Streaming pipelines
using Kafka + Avro (via ABRiS), where PermissiveRecordExceptionHandler
produces non-null structs with all-null fields on deserialization failure.

The current workaround `to_json(col) =!= "{}"` forces full JSON
serialization per row. The new functions operate directly on the
InternalRow null bitmap (a single bitwise AND per field on UnsafeRow),
achieving zero allocations and short-circuit evaluation.

Semantics:
- is_struct_empty(NULL) -> NULL
- is_struct_empty(struct(null, null)) -> TRUE
- is_struct_empty(struct(1, null)) -> FALSE
- is_struct_non_empty is the logical complement

Implementation details:
- Whole-stage codegen with two strategies: unrolled AND/OR chain for
  narrow structs (<=8 fields), loop with break for wide structs
- Type checking via ExpectsInputTypes with StructType
- Shallow check only (nested non-null structs count as non-null)
- nullIntolerant=true for optimizer IsNotNull inference

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant